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(b) All the claims are believed to be directed to a single invention. If the 
Office determines that all the claims presented are not obviously directed to a single 
invention, then Applicants will make an election without traverse as a prerequisite to the 
grant of special status. 

(c) Pre-examination searches were made of U.S. issued patents, including 
a classification search, a computer database search, a keyword search, a literature search, and 
a foreign patent document search. The searches were performed on or around August 19, 
2004, and were conducted by a professional search firm, Kramer & Amado, P.C. The 
classification search covered Classes 711 (subclass 1 13), 713 (subclasses 310 and 321), and 
714 (subclass 6). The computer database search was conducted on the USPTO systems 
EAST and WEST. The keyword search was conducted in Classes 710 (subclass 5); 71 1 
(subclasses 112, 114, 154, and 162); 713 (subclasses 300, 320, 322, and 323), and 714 
(subclasses 5 and 7). The literature search was conducted on the Internet. The search for 
foreign patent documents was conducted on the Espacenet and Delphion databases. The 
inventors further provided two references considered most closely related to the subject 
matter of the present application (see references #4 and #5 below), which were cited in the 
Information Disclosure Statement filed with the application on February 11, 2004. 

(d) The following references, copies of which are attached herewith, are 
deemed most closely related to the subject matter encompassed by the claims: 

(1) U.S. Patent No. 5,900,007; 

(2) U.S. Patent No. 5,461,266; 

(3) U.S. Patent No. 5,734,912; 

(4) David A. Patterson et al., "A Case for Redundant Arrays of 
Inexpensive Disks (RAID); and 

(5) Japanese Patent Publication No. 2000-2933 14. 

(e) Set forth below is a detailed discussion of references which points out 
with particularity how the claimed subject matter is distinguishable over the references. 
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A. Claimed Embodiments of the Present Invention 

The claimed embodiments relate to an external storage device system and, 
more particularly, to a technology for prolonging an operation period of a disk device 
(hereafter also referred to simply as a disk) and decreasing power consumption of a storage 
device system (hereafter referred to as a disk array). The disk device's operation period 
signifies a period from the time to start using the disk device to the time when the disk device 
becomes unusable. 

Independent claim 1 recites a storage system connected to a computer. The 
storage system comprises a plurality of logical units comprising disk devices. The storage 
system receives an instruction from the computer to turn on or off a disk device 
corresponding to the logical unit. Based on the instruction, the storage system turns on or off 
the disk device corresponding to the logical unit independently of disk devices corresponding 
to the other logical units. 

Independent claim 7 recites a computer system comprising a computer; and a 
storage system. The storage system has a plurality of logical units comprising disk devices. 
The computer provides the storage system with an instruction to turn on or off a disk device 
corresponding to the logical unit. The storage system receives the instruction; and turns on or 
off the disk device corresponding to the logical unit based on the instruction independently of 
disk devices corresponding to the other logical units. 

Independent claim 14 recites a computer program product for a computer 
system comprising a computer and a storage system having a plurality of logical units 
comprising disk devices. The computer program product comprises code for the computer to 
provide the storage system with an instruction to turn on or off a disk device corresponding to 
the logical unit; code for the storage system to receive the instruction; code for the storage 
system to turn on or off a disk device corresponding to the logical unit based on the 
instruction independently of disk devices corresponding to the other logical units; and a 
computer readable storage medium for storing the codes. 

One benefit that may be derived is prolonging operation times of disk devices 
constituting a disk array and decreasing the disk array's power consumption. 
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B. Discussion of the References 

None of the following references disclose or suggest a storage system that 
receives an instruction from the computer to turn on or off a disk device corresponding to the 
logical unit and, based on the instruction, turns on or off the disk device corresponding to the 
logical unit independently of disk devices corresponding to the other logical units. 

1. U.S. Patent No. 5.900,007 

This reference discloses a data storage and retrieval system that includes a 
large array of small disk files, and three storage managers for controlling the allocation of 
data to the array, access to data, and the power status of disk files within the array. The 
operation of the data storage system centers around power management subsystem 106, 
which manages disk array 110 such that at any point in time some disk files are active 
(power-on) and others are inactive (power-off), and further such that the disk files which are 
active are those determined to be the best suited to serving the read and write storage requests 
pending in the system at that time. See column 3, line 46 to column 4, line 51. 

2. U.S. Patent No. 5.46 L266 

This reference discloses a system for controlling power consumption for an 
information processing apparatus. Reducing the power consumed by the floppy disk drive or 
the hard disk drive is achieved by monitoring the use of the disk drive by means of an 
exclusive CPU and automatically stopping a motor for the drive if there has been no access 
thereto for a given period of time. 

3. U.S. Patent No. 5,734,912 

This reference relates to a power control apparatus for an input/output 
subsystem comprising an input/output control section, which is provided in a disk unit, and 
performs control of data input/output to and from the plurality of disk modules in the same 
unit and issuance, upon power-on, of a power-on instruction in compliance with a 
predetermined procedure. 
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4. David A. Patterson et aL, "A Case for Redundant Arrays of Inexpensive Disks 
(RAID) 

This reference discloses a disk array as a type of storage device systems 
connected to a computer. The disk array is also referred to as a RAID (Redundant Arrays of 
Inexpensive Disks) and constitutes a storage device system comprising a plurality of disk 
devices arranged in an array and a control section to control them. The disk array 
concurrently operates disk devices to accelerate read requests (requests to read data) and 
write requests (requests to write data) and to provide data with redundancy. Disk arrays are 
categorized into five levels depending on types of redundant data to be added and disk array 
configurations. 

5. Japanese Patent Publication No. 2000-293314 

This reference discloses a technique to suppress the power consumption of a 
magnetic disk drive mounted on a disk array device. The device is provided with a means 
which controls the relation between the configuration of plural magnetic disk drives and 
access from a host device 101, a power-saving controlling means which controls the power- 
saving (selection of power on/off and power-saving mode) of magnetic disk drives in a set 
logical drive, and a controlling means which controls the diagnoses of the magnetic disk 
drives. This disk array device 110 shifts a prescribed magnetic disk drive to a power-saving 
mode or turns off the power (power-saving processing) after access from the device 101 does 
not exist any more and a predetermined time elapses. The magnetic disk drive undergoing 
power-save processing is subjected to diagnosis execution after a prescribed time passes at 
the start of the power-saving processing or when a designated time comes in order to 
maintain its reliability. 
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(f) In view of this petition, the Examiner is respectfully requested to issue 
a first Office Action at an early date. 



Respectfully submitted, 




Chun-Pok Leung 
Reg. No. 41,405 
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A Case for Redundant Arrays of Inexpensive Disks (RAID) 



David A Patterson, Garth Gibson, and Randy H Katz 

Computer Science Division 
Department of Electrical Engineering and Computer Sciences 
571 Evans Hall 
University of California 
Berkeley, C A 94720 
(pattrsn@ gmger berkeley edu) 



Abstract Increasing performance of CPUs and memories will be 
squandered if not matched by a similar performance increase in I/O Whde 
the capacity of Single Large Expensive Disks (SLED) has grown rapidly, 
the performance improvement of SLED has been modest Redundant 
Arrays of Inexpensive Disks (RAID), based on the magnetic disk 
technology developed for personal computers, offers an attractive 
alternative to SLED, promising improvements of an order of magnitude in 
performance, reliability, power consumption, and scalability This paper 
introduces five levels of RAIDs, giving their relative cost/performance, and 
compares RAID to an IBM 3380 and a Fujitsu Super Eagle 

I Background: Rising CPU and Memory Performance 

The users of computers are currently enjoying unprecedented growth 
m the speed of computers Gordon Bell said that between 1974 and 1984. 
single chip computers unproved in performance by 40% per year, about 
twice the rate of minicomputers [Bell Ml In the following year Bill Joy 
predicted an even taster growth [Joy 85) 

AfW > S = 2 y<5ar - 1984 

Mainframe and supercomputer manufacturers, having difficulty keeping 
pace with the rapid growth predicted by "Joy's Law," cope by offering 
multiprocessors as their top-of~the~kne product 

Bui a fast CPU does not a last system make Gene Amdahl related 
CPU speed to mam memory size using this rule [Siewiorek 82] 

Each CPU instruction per second requires one byte of main memory, 

If computer system costs are not to be dominated by the cost of memory, 
then Amdahl's constant suggests that memory chip capacity should grow 
at the same rate Gordon Moore predicted that growth rate over 20 years 
ago 

transistors/chip - 2 Year1964 

As predicted by Moore's Law. RAMs have quadrupled in capacity every 
two {Moore 75] lo three years [Myers 86] 

Recently the ratio of megabytes of mam memory to MIPS has been 
defined as alpha [Garcia 84], with Amdahl's constant meaning alpha = I In 
part because of the rapid drop of memory prices, main memory sizes have 
grown faster than CPU speeds and many machines are shipped today with 
alphas of 3 or higher 

To maintain the balance of costs in computer systems, secondary 
storage must match the advances in other parts of the system A key meas- 
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ure of magnetic disk technology is the growth in the maximum number of 
bits that can be stored per square inch, or the bits per inch in a track 
times the number of tracks per inch Called M A D , for maximal areal 
density, the "First Law in Disk Density" predicts (Frank87] 

MAD = \0<Xear-197iyiO 

Magnetic disk technology has doubled capacity and halved pnee every three 
years, in line with the growth rate of semiconductor memory, and in 
practice between 1967 and 1979 the disk capacity of the average IBM data 
processing system more than kept up with its mam memory [Stevens81] 
Capacity is not the only memory characteristic that must grow 
rapidly to maintain system balance, since the speed with which 
instructions and data are delivered to a CPU also determines its ultimate 
performance The speed of main memory has kept pace for two reasons 

(1) the invention of caches, showing that a small buffer can be managed 
automatically to contain a substantial fraction of mernory references, 

(2) and the SRAM technology, used to build caches, whose speed has 
improved at the rate of 40% to 100% per year 

In contrast to primary memory technologies, the performance of 
single large expensive magnetic disks (SLED) has unproved at a modest 
rate These mechanical devices are dominated by the seek and the rotation 
delays from 1971 to 1981, the raw seek tune for a high-end IBM disk 
improved by only a factor of twp^ while, the rotation time did not 
changePiarfcercUj Greater density means a higher transfer rate when the 
information is found, and extra heads can reduce the average seek fame, but 
the raw seek time only improved at a rate of 7% per year There is no 
reason to expect a fester rate in the near future 

To maintain balance, computer systems have been using even larger 
main memories or solid state disks to buffer some of the I/O activity 
This may be a fine solution for applications whose I/O activity has 
locality of reference and for which volatility is not an issue, but 
applications dominated by a high rate of random requests for small pieces 
of data (such as transactxin-processmg) or by a low number of requests for 
massive amounts of data (such as large simulations running on 
supercomputers) are facing a senous performance limitation 
2. The Pending I/O Crisis 

What is the impact of improving the performance of some pieces of a 

problem while leaving others the same? Amdahl's answer is now known 

as Amdahl's Law [Amdahl 67] 

1 

S « 

where 

S j= the effective speedup, 

/= fraction of work in faster mode, and 

* = speedup while in faster mode 

Suppose that some current applications spend 10% of then- tune in 
I/O Then when computers are 10X raster-according to Bill Joy in just 
over three years-then Amdahl's Law predicts ePccove speedup will be only 
5X When we have computers 100X faster-via evolution of uniprocessors 
or by mulnprocessors-this application will be less than 10X faster, 
wasting 90% of the potential speedup 
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While wo can imagine improvements in software file systems via 
buffering for near term I/O demands, we need innovation to avoid an I/O 
crisis IBoral 833 

3 A Solution: Arrays of Inexpensive Disks 

Rapid improvements in capacity of large disks have not been the only 
target of disk designers, since personal computers have created a market for 
inexpensive magnetic disks These lower cost disks have lower perfor- 
mance as well as less capacity Table I below compares the top-of-the-line 
IBM 3380 model AK4 mainframe disk, Fujitsu M2361A "Super Eagle" 
minicomputer disk, and the Conner Peripherals CP 3100 personal 
computer disk 

Characteristics IBM Fujasu Conners 3380 v 2361 v 

3380 M2361A CP3100 3100 3100 

3100 *» besur) 

Disk diameter (inches) 14 105 3 5 4 3 

Formatted Data Capacity (MB) 7500 600 100 01 2 

Price/MB(contro11er incl ) $18-510 520417 S10-S7 1-2 5 1 7-3 



MTTF Rated (hours) 


30,000 


20,00030,000 • 


1 15 


MTTF in practice (hours) 


100,000 




? 


7 i 


No Actuators 


4 


1 


1 


2 1 


Maximum VOtysecond/Actuaior 50 


40 


30 


6 8 


Typical l/O's/sccond/Actuaior 


>30 


24 


20 


7 8 


Maximum I/OVsecond/box 


200 


40 


30 


2 8 


Typical I/O's/second/box 


120 


24 


20 


2 8 


Transfer Rate (MB/sec) 


3 


25 


1 


3 4 


Power/box (W) 


6,600 


640 


10 


660 64 


Volume (cu ft) 


24 


34 


03 


800 no 



Tabic I Comparison of IBM 3380 disk model AK4 for mainframe 
computers, the Fujitsu M2361A "Super Eagle" disk for minicomputers, 
and the Conners Peripherals CP 3100 disk for personal computers By 
"Maximum 1/O'si second' we mean the maximum number of average seeks 
and average rotates for a single sector access Cost and reliability 
information on the 3380 comes from widespread experience [IBM 871 
(Gawhck87J and the information on the Fujitsu from the manual [Fujitsu 
871, »hile some numbers on the new CP3100 are based on speculation 
The pnee per megabyte is given as a range to allow for different prices for 
volume discount, and different mark-up practices of the vendors (The 8 
watt maximum power of the CP 3 100 was increased to 10 watts to allow 
for the inefficiency of an external power supply, since the other drives 
contain their own power supplies) 

One surprising fact is that the number of I/Os per second per actuator a an 
inexpensive disk is within a factor of two of the large disks In several of 
the remaining metrics, including pnee per megabyte, the inexpensive disk 
is superior or equal to the large disks 

The small sue and low power are even more impressive since disks 
such as the CP 3 100 contain full track buffers and most functions of the 
traditional mainframe controller Small disk manufacturers can provide 
such functions in high volume disks because of the efforts of standards 
committees in defining higher level peripheral interfaces, such as the ANSI 
X3 131-1986 Small Computer System Interface (SCSI) Such standards 
have encouraged companies like Adeptec to offer SCSI interfaces as single 
chips, in turn allowing disk companies to embed mainframe controller 
functions at low cost Figure I compares the traditional mainframe disk 
approach and the small computer disk approach The same SCSI interface 
chip embedded as a controller in every disk can also be used as the direct 
memory access (DMA) device at the other end or the SCSI bus 

Such characteristics lead to our proposal for building I/O systems as 
arrays of inexpensive disks, either interleaved for the large transfers of 
supercomputers [Kim 86][Uvny 871(Salem86j or independent for the many 
small transfers of transaction processing Using the information in Table 
1, 75 inexpensive disks potentially have 12 times the VO bandwidth of the 
IBM 3380 and the same capacity, with lower power consumption and cost 

4 Caveats 

We cannot explore all issues associated with such arrays in the space 
available for this paper, so we concentrate on fundamental estimates of 



pnce-pcrformancc and reliability Our reasoning is that if there arc no 
advantages m price-performance or temble disadvantages in reliability, then 
there is no need to explore further We characterize a Uansacuon-processing 
workload to evaluate performance of a collection of inexpensive disks, but 
remember that such a collection is just one hardware component of a 
complete tranaction-processmg system While designing a complete TPS 
based on these ideas is enticing, we will resist that temptation in this 
paper Cabling and packaging, certainly an issue m the cost and reliability 
of an array of many inexpensive disks, is also beyond this paper's scope 
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Small Computer 




CPU 
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Figure 1 Comparison of organizations for typical mainframe and small 
computer disk interfaces Single chip SCSI interfaces such as the Adaptec 
A1C-6250 allow the small computer to use a single chip to be the DMA 
interface as well as provide an embedded controller for each disk I Adeptec 
871 (The price per megabyte in Table 1 includes everything in the shaded 
boxes above) 

5. And Now The Bad News: Reliability 

The unreliability of disks forces computer systems managers to make 
backup versions of information quite frequently in case of failure What 
would be the impact on reliability of having a hundredfold increase in 
disks' Assuming a constant failure rate— that is, an exponentially 
distributed time, to failure-and . that failures are indepcndent-both 
assumptions made by disk manufacturers when calculating the Mean Time 
To Failure (MTTF)-the reliability of an array of disks is 



MTTF of a Disk Array 



MTTF of a Single Dak 
Number cf Disks in the Array 



Using the information in Table I, the MTTF of 100 CP 3100 disks is 
30,000/100 - 300 hours, or less than 2 weeks Compared to the 30,000 
hour (> 3 years) MTTF of the IBM 3380, this is dismal If we consider 
scaling the array to 1000 disks, then the MTTF is 30 hours or about one 
day, requiring an adjective worse than dismal 

Without fault tolerance, large arrays of inexpensive disks are too 
unreliable to be useful 
6. A Better Solution* RAID 

To overcome the reliability challenge, we must make use of extra 
disks containing redundant information to recover the original information 
when a disk fails Our acronym for these Redundant Arrays of Inexpensive 
Disks is RAID To simplify the explanation of our final proposal and to 
avoid confusion with previous work, we give a taxonomy of five different 
organizations of disk arrays, beginning with mirrored disks and progressing 
through a variety of alternatives with differing performance and reliability 
We refer to each organization as a RAID level 

The reader should be forewarned that we describe all levels as if 
implemented in hardware solely to simplify the presentation, for RAID 
ideas are applicable to software implementations as well as hardware 

Reliability Our basic approach will be to break the arrays into 
reliability groups, with each group having extra "check" disks containing 
redundant wformauon When a disk fails we assume that within a short 
lime the failed disk will be replaced end the information will be 
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reconducted on to the new duk using the redundant mformauon This 
Lime is tailed the mean time to repair (MTTR) The MTTR can be reduced 
if the system includes extra disks to act as "hot" standby spares, when a 
disk fails, a replacement disk is switched in electronically Periodically a 
human operator replaces all failed disks Here are other terms that we use 

D a total number of disks with data (not including extra check disks), 

C = number of data disks in a group (not including extra check disks), 

C » number of check disks in a group, 

riQ = DIG 5= number of groups. 

As mentioned above we make the same assumptions that disk 
manufacturers make— that failures are exponential and independent (An 
earthquake or power surge is a situauon where an array of disks might not 
fail independently ) Since these reliability predictions will be very high, 
we want to emphasize that the reliability is only of the the disk-head 
assemblies with this failure model, and not the whole software and 
electronic system In addition, m our view the pace of technology means 
extremely high MTTF are "overtaU"~for, independent of expected lifetime, 
users will replace obsolete disks After all, how many people are still 
using 20 year old disks? 

The general MTTF calculation for single-error repairing RAID is 
given in two steps First, the group MTTF is 

MTTF Duk 1 

MTTFQ roup - * 

G+C Probability of another fail we in a group 

before repairing the dead disk 

As more formally derived m the appendix, the probability of a second 
failure before the first has been repaired is 

MTTR MTTR 

Probability of - ** 

Another Failure MTTF Dtsk /(No Disks-l) MTTF Dlsk /(G+C-\) 

The intmtion behind the formal calculation m-ihe appendix comes 
from trying to calculate the average number of second disk failures during 
the repair time forX single disk failures Since we assume that disk failures 
occur at a uniform rate, this average number of second failures during the 
repair time for X first failures is 

X'MTTR 



MTTF of remaining disks m the group 

The average number of second failures for a single disk is then 
MTTR 



Disk 1 No of remaining disks in the group 

The MTTF of the remaining disks is just the MTTF of a single disk 
divided by the number of good disks in the group, giving the result above 
The second step is the reliability of the whole system, which is 
approximately (since MTTFcroup » 001 *l urte distributed exponentially) 

MTTF Group 
MTTF mud - 

Plugging it all together, we get 

MTTFryisk MTTF Dls k 1 

MTTF RAID - * * — 

G+C <G+<M)»Ar77K riQ 

(MTTF Dtsi ? 



(G+C)*n c * {G+CAYMTTR 

<MTTF Dts 0* 

MTTFjwd = 

{D+C*n G )*(G4C-l)*MTTR 



Since the formula is the same for each level, we make the abstract 
numbers concrete using these parameters as appropriate 0=100 total data 
disks, G«10 data disks per group, MTTFq^ = 30,000 hours, MTTR *» 1 
hour, with the check disks per group C determined by the RAID level 

Reliability Overhead Cost This is simply the extra check 
disks, expressed as a percentage of the number of data disks D As we shall 
see below, the cost vanes with RAID level from 100% down to 4% 

Useable Storage Capacity Percentage Another way to 
express this reliability overhead is in terms of the percentage of the total 
capacity of data disks and check disks that can be used to store data 
Depending on the organization, this vanes from a low of 50% to a high of 
96% 

Performance Since supercomputer applications and 
transaction-processing systems have different access patterns and rates, we 
need different metnes to evaluate both For supercomputers we count the 
number of reads and wntes per second for large blocks of data, with large 
defined as getting at least one sector from each data disk m a group Dunng 
large transfers all the disks in a group act as a single unit, each reading or 
writing a portion of the large data block in parallel 

A better measure for transaction-processing systems is the number of 
individual reads or wntes per second Since transaction -processing 
systems (e g , debits/credits) use a read-modify-wnfce sequence of disk 
accesses, we include that metric as well Ideally dunng smalt transfers each 
disk in a group can act independently, either reading or writing independent 
information In summary supercomputer applications need a high data rate 
while transacuon -processing need a hgh llO rate 

For both the large and small transfer calculations we assume the 
minimum user request is a sector, that a sector is small relauve to a track, 
and that there is enough work to keep every device busy Thus sector size 
affects both disk storage efficiency and transfer size Figure 2 shows the 
ideal operation of large and small disk accesses in a RAID 




(b) Several Small or Individual Reads and Wntes 
(G reads and/or wntes spread over G disks) 



Figure 2. Large transfer vj small transfers in a group ofG disks 

The six performance, metnes are then the number of reads, wntes, and 
read- modify - wn tea per second for both large (grouped) or small (individual) 
transfers Rather than give absolute numbers for each metric, we calculate 
efficiency the number of events per second for a RAID relative to the 
corresponding events per second for a single disk (This is Bond's I/O 
bandwidth per gigabyte [Boral 83] scaled to gigabytes per disk ) In this 
paper we are after fundamental differences so we use simple, deterministic 
throughput measures for our performance metric rather than latency 

Effective Performance Per Disk The cost of disks can be a 
large portion of the cost of a database system, so the I/O performance per 
disk-factonng in the overhead of the check disks-suggests the 
cost/performance of a system This is the bottom line for a RAID 
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7. First Level RAID: Mirrored Disks 

MiTTortd disks are s traditional approach for improving reliability of 
magnetic disks This is the most expensive option we consider since oil 
disks are duplicated (G=l and C-\) t and every write to a data disk is also a 
write to a check disk Tandem doubles the number of controllers for fault 
tolerance, allowing an optimized version of mirrored disks that lets reads 
occur in parallel Table II shows the metrics for a Level 1 RAID assuming 
this optimization 



MTTF 

Total Number of Disks 
Overhead Cost 
Useable Storage Capacity 

Events/Sec vs Single Disk 
Large (or Grouped) Reads 
Large (or Grouped) . Writes 
Large (or Grouped) R-M-W 
Small (or Individual) Reads 
Small (or Individual) Writes 
Small (or Individual) R-M-W 



Exceeds Useful Product Lifetime 
(4,500,000 hrs or > 500 years) 
2D 
100% 
50% 



Full RAID 

2D/S 

D/S 

4D/3S 

2D 

D 

40/3 



Efficiency Per Disk 
100/S 
50/S 
67/S 
100 
50 
67 



Table II. Characteristics of Level I RAID Here we assume that writes 
are not slowed by waiting for the second write to complete because the 
slowdown for writing 2 disks is minor compared to the slowdown Sfor 
writing a whole group of 10 to 25 disks Unlike a "pure" mirrored scheme 
with extra disks that are invisible to the software, we assume an optimtied 
scheme with twice as many controllers allowing parallel reads to all disks, 
giving full disk bandwidth for large reads and allowing the reads of 
read-modify-writes to occur in parallel 

When individual accesses are distributed across multiple disks, average 
queueing, seek, and rotate delays may differ from the single disk case 
Although bandwidth may be unchanged, it is distributed more evenly, 
reducing variance m queueing delay and, if the disk load is not too high, 
also reducing the expected queueing delay through parallelism [Livny 87} 
When many arms seek to the same track then rotate to the described sector, 
the average seek and rotate time will be larger than the average for a single 
disk, tending toward the worst case times This affect should not generally 
more than, double the! average access time to. a single. sector while still 
getting many sectors in parallel In the special case of mirrored disks with 
sufficient controllers, the choice between arms that can read any data sector 
will reduce the time for the average read seek by up to 45% [Bitton 88] 

To allow for these factors but to retain our fundamental emphasis we 
apply a slowdown factor, S, when there are more than two disks in a 
group In general, 1 55^2 whenever groups of disk work in parallel 
With synchronous disks the spindles of all disks in the group are 
synchronous so that the corresponding sectors of a group of disks pass 
under the heads simul taneouslyJKurzweil 88] so for synchronous disks 
there is no slowdown and S = 1 Since a Level 1 RAID has only one data 
disk in its group, we assume that the large transfer requires the same 
number of disks acting in concert «s found in groups of the higher, level 
RAIDS 10 to 25 disks 

Duplicating all disks can mean doubling the cost of the database 
system or using only 50% of the disk storage capacity Such largess 
inspires the next levels of RAID 

8 Second Level RAID: Hamming Code for ECC 

The history of mam memory organizations suggests a way to reduce 
the cost of reliability With the introduction of 4K and 16K DRAMs, 
computer designers discovered that these new devices were subject to 
losing information due to alpha particles Since there were many single 
bit DRAMs in a system and since they were usually accessed in groups of 
16 to 64 chips at a time, system designers added redundant chips to correct 
single errors and to detect double errors in each group This increased the 
number of memory chips by 12% to 38%-depending on the size of the 
group-but it significantly improved reliability 

As long as all the daia bits in a group are read or written together, 
there is no impact on performance However, reads of less man me group 
size require reading the whole group to be sure the information is correct, 
and writes to a portion of the group mean three steps 



/; a read step to get all the rest of the data, 

2) a modify step to merge the new and old information, 

3) a write step to write the full group, including check information 
Since we have scores of disks in a RAID and since some accesses are 

to groups of disks, we can mimic the DRAM solution by bit-interleaving 
the data across the disks of a group and then add enough check disks to 
detect and correct a single error A single parity disk can detect a single 
error, but to correct an error we need enough check disks to identify the 
disk with the error For a group size of 10 data disks (G) we need 4 check 
disks (C) in total, and if G « 25 then C = 5 [HammingSO) To keep down 
the cost of redundancy, we assume the group size will vary from 10 to 25 
Since our individual data transfer unit is just a sector, bit* interleaved 
disks mean that a large transfer for this RAID must be at least G sectors 
Like DRAMs, reads to a smaller amount implies reading a full rector from 
each of the bit- interleaved disks in a group, and writes of a single unit 
involve the read-mod if y-wnte cycle to all the disks Table III shows the 
metrics of this Level 2 RAID 
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or 12 years) 


Total Number of Disks 


140D 


1.20D 


Overhead Cost 




40% 


20% 


Useable Storage Capacity 


71% 


83% 


Events/Sec 


Full RAID 


Efficiency Per Disk 


Efficiency Per Disk 


(vs Single Disk) 




12 L2/L1 


12 12IU 


Large Reads 


D/S 


71/S 71% 


86/S 86% 


Large Writes 


D/S 


71/S 143% 


86/S 172% 


Large R-M-W 


D/S 


71/S 107% 


86/S 129% 


Small Reads 


D/SG 


07/S 6% 


03/S 3% 


Small Writes 


D/2SG 


04/S 6% 


02/S 3% 


Small R-M-W 


D/SG 


07/S 9% 


03/S 4% 



Table HI Characteristics of a Level 2 RAID The L2IL1 column gives 
the % performance of level 2 in terms of level 1 (>100% means 12 is 
faster) As long as the transfer unit is large enough to spread over all the 
data disks of a group, the large IIOs get the full bandwidth of each disk, 
divided bySto allow all disks in a group to complete Level 1 large reads 
are faster because data is duplicated and so the redundancy disks can also do 
independent accesses Small I/Os still require accessing all the disks in a 
group, so only DIG small IJO* can happen at a time, again divided bySto 
allow a group of disks to finish Small Level 2 writes are like small 
R-M-W because full sectors must be read before new data can be written 
onto part of each sector 

For large writes, the level 2 system has the same performance as level 
1 even though it uses fewer check disks, and so on a per disk basis it 
outperforms level 1 For small data transfers the performance is dismal 
either for the whole system or per disk, all the disks of a group must be 
accessed for a small transfer, limiting the maximum number of 
simultaneous accesses to DIG We also include the slowdown factor 5 
since the access must wait for all the disks to complete 

Thus level 2 RAID is desirable for supercomputers but inappropriate 
for transaction processing systems, with increasing group size increasing 
the disparity in performance per disk for the two applications In 
recognition of this fact, Thinking Machines Incorporated announced a 
Level 2 RAID this year for its Connection Machine supercomputer called 
the "Data Vault," with G » 32 and C «= 8, including one hot standby spare 
fHilIis 87) 

Before improving small data transfers, we concentrate once more on 
lowering the cost 

9 Third Level RAID: Single Check Disk Per Group 

Most check disks in the level 2 RAID are used to determine which 
disk failed, for only one redundant parity disk is needed to detect an error 
These extra disks are truly "redundant" since most disk controllers can 
already delect if a disk failed either through special signals provided in the 
disk interface or the extra checking information at the end of a sector used 
to detect and correct soft errors So information on the failed disk can be 
reconstructed by calculating the parity of me remaining good disks and 
then comparing bit-by-bit to the parity calculated for the original full 
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group When these two panties agree, the failed bit was a 0, otherwise it 
was a 1 If the check disk is the failure, just read all the data disks and store 
the group parity ui the replacement disk 

Reducing the check disks to one per group (C=l) reduces the overhead 
cost to between 4% and 10% for the group sizes considered here The 
performance for the third level RAID system is the same as the Level 2 
RAID, but the effecuve performance per disk increases since it needs fewer 
check disks This reduction in total disks also increases reliability, but 
since it is sail larger than the useful lifetime of disks, this is a minor 
point One advantage of a level 2 system over level 3 is that the extra 
check information associated with each sector to correct soft errors is not 
needed, increasing the capacity per disk by perhaps 10% Level 2 also 
allows all soft errors to be corrected "on the fly" without having to reread a 
sector Table IV summarizes the third level RAID characteristics and 
Figure 3 compares the sector layout and check disks for levels 2 and 3 
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Total Number of Disks 

Overhead Cost 

Useable Storage Capacity 

EventslSec Full RAID 

(vs Single Disk) 

Large Reads D/S _ 

Large Writes DfS 

Large R-M-W D/S 

Small Reads D/SG 

Small tfntes D/2SC 

Small R-M-W DfSG 



Exceeds Useful Lifetime 

G=10 Gp25 

(820.000 hrs (346.000 hrs 

or >90 years) or 40 years) 

1 10D 1 04D 

10% 4% 

91% 96% 



Efficiency Per Disk 
13 L3IL2 L3/L1 
91/S 127% 91% 
91/S 127% 182% 
91/S 127% 136% 
09/S 127% 8% 
05/S 127% 8% 
09/S 127% 11% 



Efficiency Per Disk 
13 13/12 L3IL1 

96/S 112% 96% 

96VS 112% 

96/S 112% 

04/S 112% 

02/5 112% 

04/S 112% 



192% 
142% 
3% 
3% 
5% 



Table IV Characteristics of a Level 3 RAID The 13/L2 column gives 
the % performance of 13 in terms of 12 and the L3/U column gives U in 
terms of LI (>100% means 13 is faster) The performance for the full 
systems is the same in RAID levels 2 and S t but since there are fewer 
check disks the performance per disk improves 

Park and Balasubramaman proposed a third level RAID system 
without suggesting a particular application [Park86] Our calculations 
suggest it is a much better match to supercomputer applications than to 
transaction processing systems This year two disk manufacturers have 
announced level 3 RAIDs for such applications using synchronized 5 25 
inch disks with G=4 and C=l one from Maxtor and one from Micropolis 
[Maginms 87} 

This third level has brought the reliability overhead cost to its lowest 
level, so in the last two levels we improve performance of small accesses 
without changing cost or reliability 

10. Fourth Level RAID Independent Reads/Writes 
Spreading a transfer, across all disks within the group has the 

following advantage 

Large or grouped transfer time is reduced because transfer 
bandwidth of the entire array can be exploited. 

But it has the following disadvantages as well 

Reading/writing to a disk in a group requires reading/wnung to 
all the disks in a group, levels 2 and 3 RAIDs can perform only 
one I/O at a time per group 

If the disks are not synchronized, you do not see average seek 
and rotational delays, the observed delays should move towards 
the worst case, hence the 5 factor in the equations above 
This fourth level RAID improves performance of small transfers through 
parallelism-the ability to do more than one I/O per group at a time We 
no longer spread the individual transfer information across several disks, 
but keep each individual unit in a single disk 

The virtue of on-interleaving is the easy calculation of the Hammtng 
code needed to detect or correct errors in level 2 But recall that in the third 
level RAID we rely on the disk controller to detect errors within a single 
disk sector Hence, if we store an individual transfer unit in a single sector, 
we can detect errors on an individual read without accessing any other disk 
Figure 3 shows the different ways the information is stored in a sector for 




Level 2 Level 3 Level 4 



RAID levels 2, 3. and 4 By storing a whole transfer unit in a sector, reads 
can be independent and operate at the maximum rate of a disk yet still 
detect errors Thus the primary change between level 3 and 4 is that we 
interleave daia between disks at the sector level rather man at the bit level 
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Figure 3 Comparison of location of data and check information in 
sectors for RAID levels 2. 3. and 4 for G=4 Not shown ts the smalt 
amount of check information per sector added by the disk controller to 
detect and correct soft errors within a sector Remember thai we use 
physical sector numbers and hardware control to explain these ideas but 
RAID can be implemented by software using logical sectors and disks 

At first thought you might expect that an individual wnte to a single 
sector sail involves all the disks in a group since (1) the check disk mu*t 
be rewritten with the new parity data, and (2) the rest of the data disk* 
must be read to be able to calculate the new parity data Recall that eac** 
parity bit is just a single exclusive OR of all the corresponding data br- n 
a group In level 4 RAID, unlike level 3, the parity calculation is rri.ch 
simpler smce, if we know the old data value and the old pant> %alue as 
well as the new data value, we can calculate the new parity informascn a> 
follows 

new parity = (old data xor new data ) xor old parity 
In level 4 a small wnte then uses 2 disks to perform 4 accesses-2 reads 
and 2 writes- while a small read involves only one read on one disk Table 
V summarizes the fourth level RAID charactcnsucs No'e that all small 
accesses improve-dramatically for the reads—but the small 
read-modify-wnte is suit so slow relative to a level 1 RAID that its 
applicability to transacuon processing is doubtful Recently Salem and 
Garcia-Molina proposed a Level 4 system [Salem 86] 

Before proceeding to the next level we need to explain the 
performance of small writes in Table V (and hence small 
read-mod lfy- writes since they entail the same operations in this RAID1 
The formula for the small writes divides D by 2 instead of 4 because 2 
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accesses can proceed in parallel the old data and old parity can be read at 
the same time and the new data and new parity can be written at the same 
time The performance of small writes is also divided by G because the 
single check disk in a group must be read and written with every small 
write m that group, thereby limiting the number of writes that can be 
performed at a time to the number of groups 

The check disk is the bottleneck, and the final level RAID removes 
this bottleneck 
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96/S 100% 
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Large Writes 
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Large R-M-W 


D/S 
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Small Writes 


Df2G 


05 120% 9% 


02 120% 


4% 


Small R-M-W 


DIG 
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6% 



Table V. Characteristics of a Level 4 RAID The L4IL3 column gives 
the % performance of LA in terms of 13 and the IAIU column gives it in 
terms of IA (>I00% means IA is faster) Small reads improve because 
they no longer tie up a whole group at a time Small writes and R-M-Ws 
improve some because we make the same assumptions as we made in 
Table II the slowdown for two related IlOs can be ignored because only 
two disks are involved 

11; Fifth Level RAID: No Single Check Disk 

While level 4 RAID. achieved parallelism forjeads, writes are still 
limited to one per group since every write must read and write the check 
disk The final level RAID distributes the data and check information 
across all the disks-including the check disks Figure 4 compares the 
location of check information in the sectors of disks for levels 4 and 5 
RAIDs 

The performance impact of this small change is large since RAID 
level 5 can support multiple individual writes per group For example, 
suppose in Figure 4 above we want to write sector 0 of disk 2 and sector 1 
of disk 3 As shown on the left Figure 4, in RAID level 4 these writes 
must be sequential since both sector 0 and sector 1 of disk 5 must be 
written However, as shown on the right, in RAID level 5 the writes can 
proceed in parallel since a wnte to sector 0 of disk 2 soil involves a write 
to disk 5 but a wnte to sector 1 of disk 3 involves a wnte to disk 4 

These changes bnng RAID level 5 near the best of both worlds small 
read-modi fy-wntes now perform close to the speed per disk of a level 1 
RAID while keeping the large transfer performance per disk and high 
useful storage capacity percentage of the RAID levels 3 and 4 Spreading 
the data across all disks even improves the performance of small reads, 
since there is one more disk per group that contains data Table VI 
summarizes the characteristics of this RAID 

Keeping in mind the caveats given earlier, a Level 5 RAID appears 
very attractive if you want to do just supercomputer applications, or just 
transaction processing when storage capacity is limited, or if you want to 
do both supercomputer applications and transaction processing 
12. Discussion 

Before concluding the paper, we wish to note a few more interesting 
points about RAIDs The first is that while the schemes for disk striping 
and parity support were presented as if they were done by hardware, there is 
no necessity to do so We just give the method, and the decision between 
hardware and software soluuons is stnetly one of cost and benefit For 
example, in cases where disk buffering is effecuve, there is no extra disks 
reads for level S small writes since the old data and old panty would be m 
mam memory, so software would give the best performance as well as the 
least cost- 
In this paper we have assumed the transfer unit is a multiple of the 
sector As the size of the smallest transfer unit grows larger than one 
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(a) Check information for 
Level 4 RAID for G~4 and 
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(b) Check information for 
Level 5 RAID for G*4 and 
C=7 The sectors are shown 
below the disks, with the 
check information and data 
spread evenly through all the 
disks Writes to sO of disk 2 
and si of disk 3 still imply 2 
writes, but they can be split 
across 2 disks to sO of disk 5 
and to si of disk 4 



Figure 4 Location of check information per sector for Level 4 RAID 
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Table VI Characteristics of a Level 5 RAID The LSlIA column gives 
the % performance ofLSm terms ofIA and the L5/U column gives it in 
terms ofU (>I0O% means LS is faster) Because reads can be spread over 
alt disks, including what were check disks in level 4, all small IIOs 
improve by a factor ofl+ClG Small writes and R-M-Ws improve because 
they are no longer constrained by group axe, getting the full disk 
bandwidth for the 4 IIOs associated with these accesses We again make 
the same assumptions as we made in Tables II and V the slowdown for 
two related IIOs can be ignored because only two disks are involved 
sector per drive- such as a full track with an I/O protocol that supports data 
returned out-of-order—then the performance of RAIDs improves 
significantly because of the full track buffer in every disk For example, if 
every disk begins transferring to its buffer as soon as it reaches the next 
sector, then S may reduce to less than 1 since there would be virtually no 
rotational delay With transfer units the size of a track, it is not even clear 
if synchronizing the disks in a group improves RAID performance 

This paper makes two separable points the advantages of building 
I/O systems from personal computer disks and the advantages of five 
different disk array organizations, independent of disks used in those array 
The later point starts with the traditional mirrored disks to achieve 
acceptable reliability, with each succeeding level improving 

• the data rate, characterized by a small number of requests per second 
for massive amounts of sequential informauon (supercomputer 
applications). 
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• the I/O rote, characicnzcd by a large number of read-modify- wnles to 

a small amoum of random informauon (transaction -processing). 

• or the useable storage capacity, 
or possibly all three 

Figure 5 shows the performance improvements per disk for each level 
RAID The highest performance per disk conies from either Level 1 or 
Level 5 In transaction -processing situations using no more than 50% of 
storage capacity, then the choice is mirrored disks (Level 1) However, if 
the situation calls for using more than 50% of storage capacity, or for 
supercomputer applicauons, or for combined supercomputer applications 
and transacaon processing, then Level 5 looks best Both the strength and 
weakness of Level 1 is that it duplicates data rather than calculaung check 
informauon, for the duplicated data improves read performance but lowers 
capacity and write performance, while check data is useful only on a failure 
Inspired by the space- time product of paging studies (Denning 78], we 
propose a single figure of merit called the space-speed product the useable 
storage fracuon times the efficiency per event Using this metric, Level 5 
has an advantage over Level I of 1 7 for reads and 3 3 for writes for G=10 
Let us return to the first point, the advantages of building I/O system 
from personal computer disks Compared to traditional Single Large 
Expensive Disks (SLED), Redundant Arrays of Inexpensive Disks (RAID) 
offer significant advantages for the same cost Table VII compares a level 5 
RAID using 100 inexpensive data disks with a group size of 10 to the 
IBM 3380 As you can see, a level 5 RAID offers a factor of roughly 10 
improvement in performance, reliability, and power consumption (and 
hence air condiuomng costs) and a factor of 3 reduction in size over this 
SLED Table VII also compares a level 5 RAID using 10 inexpensive data 
disks with a group size of 10 to a Fujitsu M2361 A "Super Eagle" In this 
comparison RAID offers roughly a factor of 5 improvement in 
performance, power consumption, and size with more than two orders of 
magnitude improvement in (calculated) reliability 

RAID offers the further advantage of modular growth over SLED 
Rather than being limited to 7,500 MB per increase for $100,000 as in 
the case of this model of IBM disk, RAIDs can grow at either the group 
size (1000 MB for $11,000) or, if partial groups are allowed, at the disk 
size (100 MB for $1,100) The flip side of the com is that RAID also 
makes sense in systems considerably smaller than a SLED Small 
incremental costs also makes hot standby spares practical to further reduce 
MTTR and thereby increase the MTTF of a large system For example, a 
1000 disk level 5 RAID with a group size of 10 and a few standby spares 
could have a calculated MTTF of over 45 years 

A final comment concerns the prospect of designing a complete 
transaction processing system from either a Level 1 or Level 5 RAID The 
drastically lower power per megabyte of inexpensive disks allows systems 
designers to consider battery backup for the whole disk array-the power 
needed for 1 10 PC disks is less than two Fujitsu Super Eagles Another 
approach would be to use a few such disks to save the contents of battery 



backed-up mam memory in the event of an extended power failure The 

smaller capacity of these disks also ties up less of the database during 

reconstrucuon, leading to higher availability (Note that Level 5 ttes up 

all the disks in a group in event of failure while Level 1 only needs the 

single mirrored disk during reconstruction, giving Level 1 the edge in 

availability) 

13, Conclusion 

RAIDs offer a cost effective option to meet the challenge of 
exponenual growth in the processor and memory speeds We believe the 
size reducuon of personal computer disks is a key to the success of disk 
arrays, just as Gordon Bell argues that the size reduction of 
microprocessors is a key to the success in muluprocessors [Bell 85] In 
both cases the smaller size simplifies the interconnection of the many 
components as well as packaging and cabling While large arrays of 
mainframe processors (or SLEDs) are possible, it is certainly easier to 
construct an array from the same number of microprocessors (or PC 
drives) Just as Bell coined the term "mulu" to distinguish a 
multiprocessor made from microprocessors, we use the term "RAID" to 
identity a disk array made from personal computer disks 

With advantages in cost-performance, reliability, power consumption, 
and modular growth, we expect RAIDs to replace SLEDs in future I/O 
systems There are, however, several open issues that may bare on the 
practicality of RAIDS 

• What is the impact of a RAID on latency? 

• What is the impact on MTTF calculations of non-exponential failure 
assumptions for individual disks? 

• What will be the real lifetime of a RAID vs calculated MTTF using the 
independent failure model? 

• How would synchronized disks affect level 4 and 5 RAID performance? 

• How does "slowdown" 5 actually behave? [Uvny 87] 

• How do defective sectors affect RAID? 

• How do you schedule I/O to level 5 RAIDs to maximize write 
parallelism? 

• Is there locality of reference of disk accesses in transaction processing ? 

• Can information be automatically redistributed over 100 to J 000 disks 
to reduce contention? 

• Will disk controller design limit RAID performance? 

• How should 100 to 1000 disks be constructed and physically connected 
to the processor? 

• What is the impact of cabling on cost, performance, and reliability? 

• Where should a RAID be connected to a CPU so as not to limit 
performance? Memory bus? I/O bus? Cache? 

• Can afde system allow differ striping policies for different files? 

• What is the role of solid state disks and WORMs in a RAID? 

• What is the impact on RAID of "parallel access" disks (access to every 
surface under the readtwnte head in parallel)? 
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Figure 5 Plot of Large (Grouped) and Small (Individual) 
Read-Modify-Wntes per second per disk and useable storage 
capacity for all five levels of RAID (D=100, G=10) We 
assume a single S factor uniformly for all levels with S**l 3 
where it is needed 



Table VII Comparison of IBM 3380 disk model AK4 to Level 5 RAID using 
100 Conners & Associates CP 3100s disks and a group size of 10 and a comparison 
of the Fujitsu M2361A "Super Eagle" to a levels RAID using 10 inexpensive data 
disks with a group size of 10 Numbers greater than 1 in the comparison columns 
favor the RAID 
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Appendix Reliability Calculation 

Using probability theory we can calculate the MTTF Gr0 up We flTSl 
assume independent and exponential failure rates Our model uses a biased 
coin with the probability of heads being the probability that a second 
failure will occur within the MTTR of a first failure Since disk failures 
are exponential 

Probabihly(at least one of the r emain ing disks failing in MTTR) 
- [ 1 . ( c -MTTR/MTTF Dlsk )(G+C- 1) j 

In all pracucal cases 

MTTFdis* 

MTTR « 

G+C 

and since (1 - e~ x ) is approximately X for 0 c X « 1 

Probabihry(at least one of the remaining disks failing in MTTR) 
ts MTTR*(G+C-l)/MTTF Dls j t 

Then that on a disk failure we flip this com 

heads => a system crash, because a second failure occurs before the 

first was repaired, 
tails => recover from error and continue 

Then 

MTTFQ roU p a Expected[Ttme between Failures] 

* Expected [no of flips until first heads] 

Expected [Time between Failures] 



Probability (heads) 
MTTF Dlsk 



(G+C)*(MTTR*(G+C-l)/MTTF|5 xs j c ) 

(MTTF Dlsk ) 2 

MTTF GrolJ p « 

(G+Q* (G+C- 1 )• MTTR 

Group failure is not precisely exponenual in our model, but we h ave 
validated this simplifying assumption for pracucal cases of MTTR « 
MTTF/(G+C) This makes the MTTF of the whole system just 
MTTF GnMlp divided by the number of groups, n c 
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^SBcoinm ; e--H^^T^gpm^jffli^si:, mm^f 
-<x^ssB©ttii*»wr*»wf^a*«w«. cct 

35 Bfll^A^SB^ttfflRrflB^ttlR^WS^ 
[0 0 16] 

40 mwoMM<DMm] KT. *^0Jc7)^^x^y u-TK 

[0017] muz, ±&mw i o i Kftiaatifc^ 
x^7Hgn i oortw**©— 

[0018] ±fiSl lOUt If ?R<0K**€r£flM» 
45 U f^X^7Kgfl 1 0 (Z^fbTn^> F£58fT 

[0 0 19] f^fX^7Hgll 10H *Xh-f> 
^7x-XIh1^ 1 2 0, Mfflffil6 0. x-^^iMlHl^ 
170, H^^yO^^x-XlHlKl 8 0, Ilf-f 
50 X^119 0, X^7ffiIIfVX^119L i 
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LANSW14 0, *3<fctA RS2 
3 2 CWJW8S1 5 0«fcD«J5fc$n*'. 
[0 0 2 0] Mffffl 6 OH, v-f ^a^nir^lt^J 

-rs^fflwffl 1 6 1, nmsva-r^n-MFttwi 6 3^ 

ffl»»l 6 OJiCD^^U 1 6 2fctt«a«a'5 1 

[0 0 2 1] 8B«5^X£SS1 9 O^tfX^T^ 
gf^^gfl 9 iKte, HJH«fiT**/hffl<OB« 

agf^X^8il9 0H RAID (Redundant Arra 
y Inexpensive Disks) IMdi^^J:^, 7KttfcE 
(HI) . 

[0 0 2 2] RAID^-m l^lj (2 0 0, 2 0 
1) £&fcMMft#l (2 0 2) ICTffl^^nS. RAID 

?)v-7*mfirtrz>m%<r<< x^gn 9 oh, ±&§£ 

§1 0 l^e>T^ir^oJfiEtTS^t3, RANK^StR 
&£a»AU RA I D W-^Sl:uy^;!/ar 7 
h£f££T<5o &*CD[^-RANKrtT^^X^i£ 
«<D*«lxU7*»«-rs (««»«) • 
[0 0 2 3] -/etc, r^>7Kttt«»OB»T 

D#t*T»«frr*. tit, ^^x^yu-r 

[0 0 2 4] RANKtD^Maz y h£©»jSfttt 
r^^7H8il 1 0©*«fl|*<OKffifcck0 
fe#>2> 0 nRANK=10yMa"7h> 1 RANK 
= nDy^JPa-7h, mRANK = n 

hii-rSdt^PlffiTftS. B1TH RAN 
K0 (2 0 0) KLUO (2 10, Dy^;^r 7 hf 
^0) , RANK1 (201) KLU 1 (211) th 
U2 (212) , RANKn (2 0 2) ICLU3 (21 
3) <hLU4 (2 1 4) 

[0 0 2 5] X^7fflKf^X^gI19lH RA 

i d y i fcigf >r x ^ gi 1 9 o\zmm 
tmttLtctz, RwmvmmTj 7s?mw£i>xmm 

©»sb^t^x^sii 9 1 hx^ymmm.^-c 
Sii9ie^. miT^x^y^aimx'YX^ 

[0 0 2 6] i2(l ^tUl6 2 ±<D&m^W^-y 
[0 0 2 7] mm^X^gg^31^-^l/2 5 0H, 



fcOyMa- 7 hH t©#f*RSLUNI:Wlt 
S^ifccfcOflFEStl*. RANKn (2 0 2) ©ttfifc 
0«-&tt, «*?!l««T»*Ct36^6, fflLUN^ffl 

ifigfl 0 1^6Dy*;Hzy Mc*fLT7v-~feX 

^tu^—v hftOffigf^X^gll 9 ooei^, 
KH^-rX^SBffiB^Se-rs. RANKn (2 0 
10 2) fc^-rJcSKEi— cor ANK?gi:»ftoni;*jpa 

a-yh#^SHlLUN(:S8tS. Igf^X^g 
Ml 9 0*»*^-Htbfc»MXttat»7 f ^X^SIfll 

15 Z>o 

[0028] mmw^mm 26on ±^^m 101^ 

9 O^IB^— Ktft**-CXtt*©«aB***r-r** 
20 [0 0 2 9] WKHttWm2 7 OH, »»7V X * 

1 9 otfBm^-pmtte^Tfrzm^^xzmmi 
25 [003 o] j^t, *mmzm-?z>B^j58k*yu-7- 

(03, 04MtfB5) Sffl^TKMtS. 
[0 0 3 1] ifigfl 0 1 H, 7^J^y3>^ 
frftfr^ iK^«c*«tt^X^7U-fS«l 10 fete 

[oo3 2]f^^7K8inoit f^eao 

^IlT^X^gfl 9 0C0«fi*SA-r*. 
35 [0 0 3 3] X^7M^X^SI1 9 111 ®« 
«±agfl 0 l^^T^irX^n^^i:^^^^, 

40 tTt)S^ 

[0 0 3 4] r^X^7H£I110H Klf^rx 

zmmi 9 o^tK^Ai, i^x^iii 9 0 

0jE*»ff£WKbfc«, E1*S««3^-:/;U2 5 0 
CO, R56LUN, flLUN, T^irX^i (5&|ri5V 
45 X^gil 9 0(OIE#SWBbfcll**J) , IMf^X^ 
^g{4®, BLUN^Iit^ (03) . f 

ILUNH RANKn (2 0 2) IMfcOfl*^ 

x^Siffii 9 0T*j*snT^*«ei:, ds;*;hl- 

y h*51fff:#llf It5i«*l*t5. LU4 (2 
50 14) H, WILUNfc:, LU4-l<i:LU4-2tL 
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2004 01 30 11:37 



f$H)2 000-293314 



TiatSHlLUNtt, RANK 1 (2 0 1) \Zyfk~t 

fflLfe2tlT^Z>m&\Zs Sit^LUN^im o L 
U 1 (2 11) ©BlLUNl:ttLU2 (2 12) 
LU2 (2 12) ©BlLUNKttLUl (2 11) £ 
aST*o 

[0 0 3 5] «>ft>£flHR&£flfc«. H4l:*tl»*7 

bit. m»*7jaao»^«*»«j»*««**v>© 

[0 0 3 6] T^-feX«f*Jt»*BBSfiRF»J (lSt7«F 
M) JSU SWSBl 6 0rt<OttP9«ltll 6 3^&»fiE©» 

2ij£#jSLTasr*. 

[0 0 3 7] ±fir^@l 0 l^e<^T^irX$SMbfc 

xttm<Dui?%)i,zL-y b \zm? zmsL? «c mm i 

Mgf^X^gfflf- y;u2 5 0<B««7^fX* 

gl^ii:lii,Ti^ig;T^ x^gl i 9 oowb 

[0 0 3 8] H4 0*a*7j8iffl7?tt, KSSnt^-B 

;i/2 5 0©7^t^!H$Kfflt5. ±ffi81^67^ 
irX^n^c<^P> ? ^;i/^Ln-/ h^T^-feXPSSOfca 

BfW2 6 0^fc»^ *#ft©D^;i/OL-y Me: 

mi 9 ocom^^ii»fT^^x«iiim : e--F<h-r'5o m 

So 

[0 0 3 9] LU1 (2 11) tt, aSmftr&BSWJ2 6 0 
£jB5SLfc»&, S8StlTMHlLUN©LU2 
(2 12) ^IHJ;D®OgA, jKVr (fiS^E— K 

msm ^fij^-ra. lu2 (212) ^mm^mm 
2 5 o<o«fliBij6w*j5i«»»$n*5e»'r, lu2 (21 

2) tCTiSSm^X^Sgl 9 0£«/BUT<r>*£«JE 
«^-HiL&K LU 2 (2 12) &Bm'ft*>!%m%: 
0O*«BB»l**J*»a*aF*i:3ti:i9, LU1 (2 11) 



£#fclfc«#%l$ra2 6 O^^ATWc^^, Qffi. 

f^x^gifiMif^x^gii 9 o<b««£tt 

05 [0 0 4 0] LUN4 (214) H (ratt^Rm 2 6 

o*jeaufc»^, s»$nTi^ffaLUN»cT«js 

Ot7*WStS. L UN 4 (2 14) li-OOD^ 

9 o ^«^$nxus ^ tA^f IL u Ni:t»iib 

10 TfiT§. 15USj&*LUN4-1, 2JIJBRUN4 
-2tlg$tlT^S. LUN4-1I1 HSLUNK 
L UN 3 (2 13) ^a»$nT0^d^d>6, LUN 

3 (213) (Dmmm^mm^mm^nx^ta^m^ 

• \*. LUN 3 (2 13) \ZT&m*-C$>2>Z£fr*> \&\ 
15 g^I^f^X^Sil 9 0©«jB&^-7L/4t^i. LU 
N4-2H SlLUN^V^tjd^, 2#IB(0«§C 
T^X^gil9 0©tS4t7tS. SMSr^LTc 

mm\t, mmmt^mm\z^mr^o ^3i:i/tLu 
N4m h^wt^a^x^ 

[0041] ms&mwa&WTit. s^^ntv^p^ 

7 fit;, i^r^X^glflr- >OU2 
5 0 OgB«BBS6«p»J*E«f S, fflS«W&l$SJK:a»£ 

2 7 0 £j@xfcB$;&T, ^©Dy^J^Z-; b\zmt 

^jxzmw&wiz&mztiT^&m^jx&mm 1 

30 9 O<0»Wr£*fTf5. iffi»7l^, Wl*- H 

[0042] mm&nn *rz>m&, ±^gi 101^6 

35 

[0 0 4 3] ilr^X^gll 9 0©»«H »«rBfl 
S6«fW2 7 01:2 4»MWf«-©»MSa«"r*ca:fcJ; 

40 60«a»IHIT*»fftWJftUT'bJ:if^ 

[0 0 4 4] ^Wr&^ff "TS*^, 101^6 
©7£-feXj&a£«$feU 7£-teX£gfigbfc££te, 
MWSffltU 7^-feX£^?rU 7^-trXMa^T 
^W^If^-rSo tSril i^X^glflr 
45 — 7^2 5 h^ffiH 

T^Zm&te. Wttfc#»TfIlT^5fILUN 

1 9 0(D^^[Hj^(cfr3o H 1 
50 Lfrl>2£<a«*7^X*g«£^afrT* ^RANKC 
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RANKOr>fX^7K <h lsT<DWaW&)Mm.T'< X^ 
[0 0 4 5] 1RANKJC 

[0 0 4 6] WfflBar^fXJSBl 9 1#R 
«U -jEKflHHIiabfcKfja-efTi. WKBM&B9IHJ2 7 0 

[0 0 4 7] fi6m§i^»^J 2 6 0 £»KHttl«ffltt. * 
Xh3V>h\ «#/^;H 3 0, LAN#- h 14 0 
*3<fctf R S 2 3 2 C#-h^«lTt^ flOfe 

srSWfct^^^qEu 1 6 2<Dmm*m.m?z>o ^xh 

1 3 0lc:J:««5Ett, 8fP/^;HI:iin^^«©f 

A*"r£££K:<J: DtT^. LAN#-h^cJ:^RS2 3 

2 C#-McJ:S&Mte, LAN*5 e tr/RS 2 3 2 C(: 

3tf*««£ii«>fc3K LAN*5J;tfRS 2 3 2 COffi 
[0 0 4 8] 

[0 0 4 9] J:D*#WICJ4. «^Ltf, 0teS®J«fH2 
0 WOt*M«^£ 1 -&<7)^m^^ X^SI^^T, 
-^oa^lfilK-Cte'J — K9-f h^ffC 5. 5W, >J-h'7 
-Y h7*f K;M*K13. 5W©t*Sf^»0, ^o^m 
x^36«<BXfcf> k;^7< H;HsrtE«fjrtt4w^ 



[o o s o] nmz\$RANKM&T?mm-?z>frv>, 5 
7,¥>\t)i<<D®&b±&n&z.n\zm?tT2 ow 

10 ^^SW^lMSWS^^uii/^o 

15 [Bionv&Rni 

[Ell] *98W0 1*JIS«|K«to*^X^7W«B 

[H2] *^§8oi^jfiffjt«*?««a^--y;i/oaT 

20 [El 3] *56W© 1 *K«l:«fe5i«f >f 7^8it 

[05] *sg^coi^«0y^«fe*aa^x^s«fi 

25 fi»»fftff7P-^^-hT*S. 
[f3F^Oi&91] 

1 0 1 -±ffi»«, 1 1 0 •• ^ x^ 

7W&B, 1 2 0--^Xh-r>^7x.-XlHl^, 13 
0 -UUff/Wk 14 0-LAN (D-*JH)J7*y 
30 h7-^7) *-h, 1 5 0-RS 2 3 2C#-h, 

1 6 0-«OW8B, 1 6 l»-£*MMk 

1 6 2-^tU, 1 6 3 

18 0-K7-f^>^7 
x-XIUB, 19 0-IMT^7^il, 19 1-X^ 

35 7M^X^gl, 2 0 0-RANK0, 2 0 1- 
RANK 1 , 2 0 2-RANK2, 21 

0-LU {U*J1i)V3-—y h) 0, 2 11-LU (Ov* 
#;iot.-yh) i, 2 12-LU (DyMaz^M 
2, 2 13-LU (Dy^;^Z7h) 3, 2 14-L 

40 U (Uz^ti)V=L- v h) 4, 2 5 0-Igr^X^gI 
Sg-r-^k 2 6 0-l|S«fir%Hf*J2 7 0-^»fP85& 
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T4X*7l"<&m *SHJ2 0 0 0-2 93 3 1 4 
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L , 







END 2^ 



(72)^BJ# /Jn# jE9J 35 F^-A(##) 5B011 EB07 LL14 

W*iU»/hBaSTUHJffj*2880#flfi t*5t# 5B065 BA01 CA16 CA30 CC01 ZA14 

ttBSfflffHffX h 5^X5^**ffif*J 5D066 BA02 BA05 
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